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Abstract. An efficient direct solver for volume integral equations with 0{N) complexity for a broad range of 
problems is presented. The solver relies on hierarchical compression of the discretized integral operator, and exploits that 
off-diagonal blocks of certain dense matrices have numerically low rank. Technically, the solver is inspired by previously 
developed direct solvers for integral equations based on "recursive skeletonization" and "Hierarchically Semi-Separable" 
(HSS) matrices, but it improves on the asymptotic complexity of existing solvers by incorporating an additional level of 
compression. The resulting solver has optimal 0{N) complexity for all stages of the computation, as demonstrated by both 
theoretical analysis and numerical examples. The computational examples further display good practical performance in 
terms of both speed and memory usage. In particular, it is demonstrated that even problems involving lO'^ unknowns can 
be solved to precision 10"^" using a simple Matlab implementation of the algorithm executed on a single core. 



1. Introduction. Many boundary value problems from classical physics, when cast as boundary 
or volume integral equations, take the form 

(1.1) a{x)aix) + J^IC{x,y)aiy)dS{y) = fix), Vx G T, 

where F is a domain in either or M.^ (either a boundary or a volume) and JC{x,y) is a kernel function 
derived from the fundamental solution associated with the relevant elliptic PDE (e.g. the Laplace or 
Helmholtz equation, the Stokes equations, etc.). The kernel function is typically singular near the 
diagonal (as y approaches x) but is otherwise smooth. 



Discretizing the integral in (1.1) using, e.g., the Nystrom method results in a linear system of 
the form 



(1.2) 



A<J = f, 



where A is a dense N x N matrix. For many problems it has been known since the 1980's how to rapidly 
evaluate the matrix-vector product a Aa using 0{N) operations, see, e.g., 
multiplication techniques can be coupled with an iterative solver (e.g. GMRES 



21 



34 



. Such matrix-vector 
or Bi-CGSTAB [37]) 

to attain solvers for ( 1.1 ) that can be very effective whenever convergence is rapid. More recently, direct 



solvers with linear complexity have been developed for ( |1.2| for the case when F is a contour in the 
plane [31] . When the method of ^3T\ is applied to a volume domain in the plane, a reasonably efficient 
direct solver with asymptotic complexity 0(iV^/^) results [I7][l8j|26]. This paper describes how the 
algorithm of [3l] can be reworked to construct a direct solver for volume integral equations in two 
dimensions that has optimal 0{N) complexity and high practical efficiency (even at high accuracies 
such as 10~^°). We believe that this solver will also (with some additional work) be capable of directly 
solving boundary integral equations for problems in 3D in 0{N) or 0(iVlog A^) operations, and will be 
very well suited for implementation on multicore and parallel computers. 

1.1. Previous Work. The key observation enabling the construction of 0{N) algorithms for 
solving linear systems arising from the discretization of integral equations is that the off-diagonal blocks 
of the coefficient matrix can be very well approximated by matrices of low numerical rank. In this 
section, we briefly describe some key results. 

Optimal complexity iterative solvers for integral equations. By combining the observa- 
tion that off-diagonal blocks of the matrix have low rank with a hierarchical partitioning of the physical 
domain into a tree-structure, algorithms of 0{N) or 0{N log N) complexity for matrix- vector multiplica- 
tion were developed in the 1980's. Perhaps, the most prominent is the Fast Multiple Method 21 22p3 



but the Panel Clustering 23 and Barnes-Hut [2] methods are also well known. When a fast algorithm 
for the matrix- vector multiplication such as the FMM is coupled with an iterative method such as, e.g.. 



GMRES 34 or Bi-CGSTAB 37 , the result is a solver for the linear system arising upon discretization 
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of (1.1) with overall complexity 0{M N), where M is the number of steps required by the iterative 
solver. In many situations, M is independent of N and convergence can be very rapid. 

The first FMMs constructed were custom designed for specific elliptic equations (e.g. Laplace, 
Helmholtz, Stokes), but it was later realized that kernel-independent methods that work for broad 



classes of problems could be developed 19 42]. They directly inspired the work described in this paper, 
and the new direct solvers can also be said to be "kernel-independent" in the sense that the same 
algorithm, and the same code, can be applied to several different types of physical problems such as 
electro-statics. Stokes flows, and low frequency scattering. 

While iterative solvers accelerated by fast methods for matrix-vector multiply can be very effective 
in many contexts, their performance is held hostage to the convergence rate of the iteration. If the 
equation is not well-conditioned, the complexity of an iterative solve may increase. Pre-conditioners 
can sometimes be constructed to accelerate convergence, but these tend to be quite problem-specific 
and do not readily lend themselves to the construction of general purpose codes. Examples of ill- 
conditioned problems that can be challenging to iterative methods include Fredholm equations of the 
first kind, elasticity problems on thin domains, and scattering problems near resonances. 

Direct solvers for integral equations. In the last ten years, a number of efficient direct solvers 
for linear systems associated with integral equations have been constructed. These solvers entirely side- 
step the challenges related to convergence speed of iterative solvers. They can also lead to dramatic 
improvements in speed, in particular in situations where several of equations with the same coefficient 
matrix but different right-hand sides need to be solved. 

The work presented here draws heavily on [31| , (based on on [36| ), which describes a direct solver 
that was originally developed for boundary integral equations defined on curves in the plane, and has 
optimal 0{N) complexity for this case. The observation that this algorithm can also be applied to 



volume integral equation s in the plane, and to boundary integral equations in 3D was made in 20 



and later elaborated in 17 . For these cases, the direct solver requires an 
approximation to the inverse of the matrix, and 0{N log N) flops for the "solve stage" once the inverse 
has been constructed. Similar work was done in [26| , where it is also demonstrated that the direct 
solver can, from a practical point of view, be implemented using standard direct solvers for large 
sparse matrices. This improves stability, and greatly simplifles the practical implementation due to the 
availability of standard packages such as UMFPACK. 

The direct solver of (sTI relies on the fact that the matrices arising form the discretization of integral 
equations can be efflciently represented in a data-sparse format often referred to as "Hierarchically Semi- 
Separable (HSS)" matrices. This matrix format was also explored in [9{[Tl], with a more recent efficient 



version presented in 40 . This work computes ULV and Cholesky factorizations of HSS matrices; if 



these techniques were applied to volume integral equations in 2D, the complexity would be 0{N^^'^), 



in complete agreement with 20 and 17 . The paper [39] presents a general complexity study of HSS 
algorithms, under different rank growth patterns; it presents an optimal-complexity HSS recompression 
method which we adapt to our setting as a part of the overall algorithm. 

An important class of related algorithms is TL and 'H^-matrix methods of Hackbusch and co-workers 
(see [s'i'o'i'T] for surveys). These techniques are based on variations of the cross approximation method 
for low-rank compression, and have been applied both to integral equations and sparse systems derived 
from PDEs. The matrix factorization for two and three-dimensional problems algorithms are formulated 
recursively, and a full set of compressed operations for lower-dimensional problems needs to be available. 
In [4|[5], algorithms for TL^ matrix arithmetics are described; Observed behavior for integral equation 
operators on the cube and on the sphere in chapter 10 of ItI is 0(A^log'*A^) for matrix compression, 
0(iV log^ N) for inversion and 0(A^log^ N) for solve time and memory use. 

Direct solvers for sparse systems. Our direct solver is conceptually related to direct solvers 



for sparse system matrices such as the classical nested dissection and multifrontal methods 15 16 27 
These solvers do not have optimal complexity (they typically require 0{N^/'^) for the factorization 
stage in 2D, and O(iV^) in 3D), but are nevertheless popular (especially in 2D) due in part to their 
robustness, and in part to the unrivaled speed that can be attained for problems involving multiple right 
hand sides. Very recently, it has been demonstrated that by exploiting structured matrix algebra (such 
as, e.g., "H-matrices, or HSS matrices), to manipulate the dense matrices that arise due to fill-in, direct 
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solvers of linear or close to linear complexity can be constructed 17 ,28 30 35 41 . The direct solver 
described in this paper is conceptually similar to these algorithms in that they all rely on hierarchical 
domain decompositions, and efficient representations of operators that live on the interfaces between 
sub-domains. 

1.2. Overview of new results. We present a direct solver that achieves optimal 0{N) complexity 
for two-dimensional systems derived from integral equations with non-oscillatory kernels or kernels in 
low-frequency mode. Similarly to other HSS or H-matrix methods, we rely on the fact that the system 
is derived from an integral equation only weakly: if the kernel is of a different nature, the scalability of 
the solver may deteriorate, but it can still perform accurate calculations. 

The main features of our solver include: 

• Observed 0{N) complexity both in time and storage for all stages of the computation (in 
particular, for both the "build" and the "solve" stages). 

• The algorithm supports high accuracy (up to 10"^" — 10"^^ is practical), while maintaing 
reasonable efficiency both in time and memory. 

• The algorithm can take direct advantage of translation invariance and symmetry of underlying 
kernels, achieving considerable speedup and reduction of memory cost. 

The main aspects of the algorithm that allow us to achieve high performance are: 

• Two levels of hierarchical structures are used. The matrix A as whole is represented in the HSS 
format, and certain blocks within the HSS structure are themselves represented in the HSS 
format. 

• Direct construction of the inverse: unlike many previous algorithms, we directly build a com- 
pressed representation of the inverse, rather than compressing the matrix itself and then in- 
verting. 

• Our direct solver needs only a subset of a full set of HSS matrix arithmetics; in particular, the 
relatively expensive matrix-matrix multiplication is never used. 

We achieve significant gains in speed and memory efficiency with respect to the existent 0{N'^^'^) 
approach. For non-oscillatory kernels, our algorithm outperforms the 0{N^^^) algorithm around N ^ 
10^. Sizes of up to iV ~ 10^ are practical on a desktop gaining one order of magnitude in inverse 
compression time and storage. For example, for non-translation-invariant kernels, a problem of size 
= 3 X 10^ and target accuracy e = lO^^*' takes 1 hour and ^ 50GB to invert. Each solve takes 10 
seconds. For translation-invariant kernels such as Laplace, a problem of size N = 1 x 10^ and target 
accuracy e = 10^^° takes half an hour and 5 GB to invert, with 20-second solves. Reducing target 
accuracy to e = 10"^, inversion costs for the latter problem go down to 5 minutes and 1GB of storage, 
and each solve takes 10 seconds. 

Our accelerated approach to build the HSS binary tree also yields a fast 0{N) matrix compression 
algorithm. As is noted in |26 , given that matrix- vector applies are orders of magnitude faster than 
one round of FMM, this algorithm would be preferable to an interative method coupled with FMM for 
problems that require more than a few iterations. To give an example, for N — 10^ and e = 10"^*^, 
matrix compression takes 5 minutes and 1GB of storage, and each matrix apply takes less than 10 
seconds. 

Finally, we are able to apply our method with some minor modifications to oscillatory kernels in low 
frequency mode, and apply this to the solution of the corresponding 2D volume scattering problems. 
Although the costs are considerably higher, we still observe optimal scaling and similar performance 
gains. 

2. Background. Our approach builds on the fast direct solver in fTs!, which consists of separate 
hierarchical compression, inversion and inverse apply algorithms, all of which achieve linear complexity 
when applied to integral operators on one dimensional curves. These algorithms can be applied to 
volume integral equations in 2D and to boundary integral equations defined on surfaces in 3D but 
compression and inversion then have 0{N'^/'^) complexity, while application of the inverse is 0{N log N). 
The super-linear complexity renders the technique impractical for large problems, it is in practice useful 



for small and moderate size problems, see 17 26 



In this section we review the algorithms for compression and inversion described in 1 18 , as these 
are needed to describe our algorithms, and discuss their limitations for problems on 2D domains. 
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2.1. Notation and Preliminaries. We view an x iV matrix A as a kernel function K — K{p, q) 
evaluated at pairs of sample points. As it typically comes from an integral formulation of an elliptic 
PDE, aside from a low-rank block structure we also expect Green's identities to hold for K; this however, 
is not a fundamental limitation of our algorithm: the use of Green's identities is restricted to one small 
part of the algorithm (Section 2.2 ) relying on equivalent density representation and can be replaced by 
any other technique of similar nature. 

We use matlab-like notation A{I, J) where / and J are ordered sets of indices to denote submatrices 
of a matrix A. Again, following matlab conventions, A(:, J) and A{I, :) indicate blocks of columns and 
rows, respectively. 

An essential building block for our algorithm is the interpolative low-rank decomposition. This 
decomposition factors an m x n matrix A into a narrower skeleton matrix A^^ = A{:, P^) of size m x k, 
consisting of a subset of columns of A indexed by the set of indices P^, and the interpolation matrix T 
of size kx {n — k), expressing the remaining columns of A as linear combinations of columns of A^^: The 
set of indices is called the column skeleton of A. If n^*^ is the permutation matrix placing entries 
with indices from P^ first, 



(2.1) 



sk 



E 



where ||-E||2 ~ o'k+i vanishes as we increase k, and R — [IkxkT]Il^^ is a downsampling interpolation 
matrix. We denote this compression operation by [T, P'^] — ID{A,e), where e controls the norm of E. 
To obtain a similar compression for rows, we apply the same operation to A'^; in this case, we obtain 



a factorization A = (Hf^)'^ 
matrix. 



Ikxk 



sk\T 



Af + Er, where L = (Uf) 



Ikxk 



is an upsampling interpolation 



2.2. Constructing hierarchically semi-separable matrices. We assume that the domain of 
interest is contained in a rectangle fl, with a regular grid of samples (if the input data is given in a 
different representation, we resample it first). A quadtree is constructed by recursively subdividing O 
into cells (boxes Bi) by bisection, corresponding to the nodes of a binary tree T. The two children of a 
box are denoted ci{i) and C2(i). Subdivision can be done adaptively without significant changes to the 
algorithm, but to simplify the exposition we focus on the uniform refinement case. Let li denote the 
index vector marking all discretization points in box B^, and let C denote a list of the leaf boxes. Then 
{Ii}ii£C forms a disjoint partition of the full index set 



(2.2) 



{1, 2, 3, ...,7V}= |J/„ 



Let mi denote the number of points in Bi. The partition (2.2) corresponds to a blocking 



(2.3) 



i e C, 



of the linear system ( 1.2 1, where Aij = A{Ii,Ij), and the vectors a and / are partitioned accordingly. For 



a linear system such as ( 1.2 ) arising from the discretization of an integral equation with a smooth kernel, 



the off-diagonal blocks of (2.3) typically have low numerical rank. Such matrices can be represented in 



an efficient data-sparse format called hierarchically semi-separable (HSS). In order to rigorously describe 
this format, we first define the concept of a block-separable matrix. 

Definition 2.1 (Block-separable Matrices). We say A is block-separable if there exist matrices 
{Li, Ri}i^C such that each off-diagonal block Aij in (2.3) admits the factorization 



(2.4) 



A, 



Li Mij Rj 

miXki kiXkj kjXrrij 



where the block ranks ki satisfy ki < m^. 

In order to construct the matrices Li and Ri in (2.4) it is helpful to introduce block rows and block 
columns of A: For 



G £, we define the ith off-diagonal block row of A as 
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Aih,I\I,) 



[Ai^i . . . Ai_i^iAi_i+i . . . Ai^p]. The jth off-diagonal block column Aj°' is defined analogously. Given a 
prescribed accuracy e, we denote by and kf the e-ranks of A^""" and A™', respectively. 

Now that we have defined A™™ and we use them to obtain the factorization (2.4) as follows: 
For each i G C, form interpolative decompositions of and 

(2.5) ^r"'= A(/r\:) and A^' = A(/r\ :) R, , 

miXkikiX{N-mi) {N-mi)xkiki><.mi 

where the index vectors I^^^ and If^^ are the row-skeleton and column- skeleton of block i, respectively. 
Note that the columns of Li are a column basis for A^°"' , and the rows of Ri a row basis of Setting 



we then find that (2.4) necessarily holds. Observe that each matrix Mij is a submatrix of ^. 



Set Di ~ Ai^i. This yields a block factorization for A: 
(2.6) v4 = D'^ + L'^A'^-^R^ 

where D'', L"*, and i?'^ are the block diagonal matrices whose diagonal blocks are given by {Di}i,zc, 
{Li\iec^ and respectively. The matrix A'''^^ is the submatrix oi A corresponding to the union 

of skeleton points, with diagonal blocks zeroed out and off-diagonal blocks Mij. 

Remark 2.2. Row and column skeleton sets need not coincide, although for all purposes, we will 
assume they are augmented so that they are the same size (so Di blocks are square). If the system 
matrix is symmetric, as is the case for all matrices considered in this paper, these sets are indeed the 
same and further Ri — Lf . For this reason, as well as for the sake of simplicity, we make no further 
distinction between them unless it is necessary. 

Hierarchical compression of A. The key property that allows dense matrix operations to be 
performed with less that O(iV^) complexity is that the low-rank structure in definition 2.1 can be 



exploited recursively in the sense that the matrix A''"^ in (2.6) itself is block-separable. To be precise, 
we re-partition the matrix A'^^^ by merging 2x2 sets of blocks to form new larger blocks. Each larger 
block is associated with a box i on level d — 1 corresponding to the index vector lf^(^i-^ LI -f^^j-j) (the 
new index vector holds = k^^ + k^^ nodes). The resulting matrix with larger blocks is then itself 
block-separable and admits a factorization, cf . Figure |2.1[ 



(2.7) A^D'^ + L'^{D'^-^ + L'^-^ A'^-^ R'^-^)R'^ 




+ 



ffl 



Fig. 2.1. Two levels of block- separable compression: 
off-diagonal interactions are further compressed. 



blocks of M corresponding to children are merged and then 



We say that A is hierarchically semiseparable (HSS) if the process of reblocking and recompression 
can be continued through all levels of the tree. In other words, we assume that A^ — + L^A^^^R^ 
for ^ = d . . . 1, or, more explicitly, 

(2.8) A'^ = 0"^ + L'^{D'^-^ + L-^-^iD^-^ + ...{D^ + L^D°R^) . . .)R^-^)R^ 
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with A — A'^, A''^ — , and , and block-diagonal, with blocks in matrices with index £ 
corresponding to boxes at level £. For non-leaf boxes, blocks account for "sibling interactions," in 
other words interactions between the children ci(i) and C2{i) of Bi, 



(2.9) 



M, 



M, 



ci(i),C2(i) 



C2(i),ci(i) 



We call (2.8) the telescoping factorization of A. The matrices under consideration in this manuscript 



are (like most matrices arising from the discretization of integral operators) all HSS. 

Remark 2.3. In this paper we use the term hierarchically semi-separable (HSS) to conform with 
standard use in the literature, see J^- ll 'J^. In \1S\\2()[\31^ the term hierarchically block-separable 
(HBS) is alternatively used to refer to this hierarchical version of block- separability, consistent with 
definition \2.1\ 

Using equivalent densities to accelerate compression. A matrix is block-separable as long 
as all sub-matrices A] ™ a nd Af^"^ are rank-deficient. However, these matrices are large, so directly 
computing the IDs in (2.5) is expensive (0(7V^) cost 39|). In this section, we describe how to exploit 



the fact that the matrix to be compressed is associated with an elliptic PDE to reduce the asymptotic 
cost. Related techniques were previously described in |42j and [31] . 

Let Bi denote a leaf box with associated index vector li . We will describe the accelerated technique 
for constructing a matrix Li and an index vector If^"^ C U such that 



(2.10) 



A'°'" ^L,A{Il^^, 



holds to high precision. (The technique for finding Ri and li such that (2.5) holds is analogous.) 



For concreteness, suppose temporarily that the kernel K, is the fundamental solution of the Laplace 
equation, JC{x, y) — ^ log \x — y\. The idea is to construct a small matrix A™* with the property that 



(2.11) 

In other words, the columns of Aj" 
small matrix A™'", 

(2.12) 



Ran (A™") C Ran(i™^). 

need to span the columns of A™"". Then compute an ID of the 



A{I 



rsk 



Now ( |2.11[ ) and ( |2.12[ ) together imply that ( |2.10[ ) holds. 

It remains to construct a small matrix whose columns span the range of A™*. To do this, 

suppose that v G Ran(AJ°^), so that for some vector q e M^^™' v ~ A™* q. Physically, this means that 
the values of v represent values of a harmonic function generated by sources q located outside the box 
Bi . We now know from potential theory that any harmonic function in Bi can be replicated by a source 
density on the boundary dBi. The discrete analog of this statement is that to very high precision, we 
can replicate the harmonic function in Bi by placing point charges in a thin layer of discretization nodes 
surrounding Bi (drawn as solid diamonds in Figure 2.2 'b)). Let {zjl^Li denote the locations of these 



points. The claim is then that v can be replicated by placing some "equivalent charges" at these points. 
In other words, we form Af" as the matrix of size x pi whose entries take the form IC{xr, Zj) for 
r e li, and j = 1, 2, . . . , pi. 

Remark 2.4. Figure 2.2 shows an example of how accelerated compression works. Figure \2^a) 



illustrates a domain Q. with a sub-domain Bi (the dotted box). Suppose that ip is a harmonic function 
on Bi. Then potential theory assures us that can be generated by sources on dBi, in other words 
(p{x) — IC{x,y) a{y) ds{y) for some density a. The discrete analog of this statement is that to high 
precision, the harmonic function if can be generated by placing point charges on the proxy points 
marked with solid diamonds in Figure \27^ b) . The practical consequence is that instead of factoring the 
big matrix A^™ which represents interactions between all target points in Bi (circles) and all source 
points (diamonds), it is enough to factor the small matrix Af"" representing interactions between target 
points (circles) and proxy points (solid diamonds). 

Remark 2.5. The width of the layer of proxy points depends on the accuracy requested. We found 
that for the Laplace kernel, a layer of width 1 leads to relative accuracy about 10^'^, and width 2 leads to 



(a) 



(b) 



Fig. 2.2. (a) A domain Q (solid) with a sub-domain Bi (dotted), (b) Target points in Bi are circles, source points 
are diamonds, and among the source points, the proxy points are solid. 



relative accuracy 10^^". For the Helmholtz kernel JC{x,y) ~ HQ^\K\x^y\), similar accuracy is typically 
observed, but in this case, thicker skeleton layers are recommended to avoid problems associated with 
resonances. (Recall that in classical potential theory, a solution to the Helmholtz equation may require 
both monopole and dipole charges to be placed on dBi.) 

2.3. HSS matrix-vector multiplication. To describe the process of computing the inverse of 
an HSS matrix in compressed form, it is convenient first to explain how matrix-vector products can be 
computed. The telescoping factorization (2.8) yields a fast matrix-vector apply algorithm evaluating 
u = Aa. The structure of this algorithm is similar to FMM (but simpler, as we do not treat the near- 
field separately, approximating all external interactions with a single set of coefficients). To emphasize 
the underlying physical intuition, we refer to values a at points as charges and to the values u we want 
to compute as potentials. On non-leaf boxes, we use notation (pl for the charges assigned to the skeleton 
points of the box, and u^^'^ for computed potentials. The vector (p^ (i/) is the concatenations of all 
charges (potentials) of the boxes at level I. At the finest level, we define (j)'^ :— a. 

Upward pass. The upward pass simply uses the rectangular block-diagonal matrices to compute 
(j)^: 4>^ — Each block R^^^ acts on the the subvector of charges corresponding to the children 

of Bi at level i. 

Downward pass. We compute a potential u,; for each box, due to all outside charges, starting 
from two boxes at level 1. For the top-level boxes Bi and B2, the values are computed directly, using 
the sibling interaction matrix forming D'^ as defined by (2.9), in other words = D^cj)^. 



For boxes at the level £ > 1, the outside field on the boxes is obtained as the sum of the fields 
interpolated to the boxes at level i using ("tall" rectangular block-diagonal) and contributions of 
the siblings through square block diagonal matrix D^: 



At the leaf level, the last step is to transfer to the sample points and add the field due to boxes themselves 
(self-interactions, stored in the diagonal blocks Ai^i of D'^), The actions of different transformations are 
summarized in the following computational flow diagram: 




D"=A" 



2.4. Computing the HSS form of A'^ 

can typically be inverted directly, yielding an HSS representation of A 



If A is non-singular, the telescoping factorization (2.8) 
^ , which can be applied efficiently 
using the algorithm described in Section 2.3 The inversion process is also best understood in terms of 
the variables introduced in Section [2731 

First, we derive a formula for inversion of matrices having single- level telescoping factorization 
Z = F + LMR. (The matrix Z on the first step coincides with A = A'', and F = D'^, but both Z and 
F are different from A^ and on subsequent steps). 
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We consider the system Za = /, and define (p = cf)'^ ^ — Ra, u — u'^ ^ — M(j). We then perform 
block-Gaussian elimination on the resulting block system: 



(2.13) 



F 


L 







a 




f 


-R 





/ 




u 










-/ 


M 




4> 








We form the auxiliary matrices E 




RF- 


-^L, and E + M. Then 




'F 0" 




a 




"[/ - LERF-'^ + LEG-'^ERF-'^]/' 


(2.14) 


E-'^ 




u 




[RF-^ - G-^ERF-^]f 




G 









ERF-^f 



which yields the inverse of Z by solving the block-diagonal system in the first line. Denoting R 
ERF-^, D = F-\I - LR) and L = F-^LE, we obtain: 



(2.15) 



Z~^ = D + L{E + M)-^R. 



We make a few observations: 

• If Z?, L and R are block diagonal, then so are E, R, L and D. This means that these matrices 
can be computed inexpensively via independent computations that are local to each box. 

• The factors in the inverse can be interpreted as follows: 

— E~^ — RF~^L can be viewed as a local solution operator "reduced" to the set of skeleton 
points for each box. It maps fields to charge densities on these sets. 

— E + M maps charge densities on the union of skeleton points to fields, adding diagonal 
Ea and off-diagonal Ma contributions. 

This inversion procedure can be applied recursively. To obtain a recursion formula, we define an 
auxihary matrix A*^ ^ A*"' + E^+^ , for I < d, and A'^ = A'^ ^ A; F^ ^ + E'^+^, and F'^ = D'^. Then 

'^R^, and its inverse by (2.15) satisfies 



A satisfies A 



(2.16) 



[A'-y 



d' + l'{e' + a' 



'r' = d' + l\a' 



A fine-to-coarse procedure for computing the blocks of the inverse immediately follows from (2.161: for 
each layer £, we first compute F^, using E^~^^ for the finer layer (zero for the finest). F^ determines D^, 
R^ and L^, and E^, to be used at the next layer. 

Algorithm [l] summarizes the HSS inversion algorithm; it takes as input a tree T with index sets 
defined for leaf nodes, and matrices Di, Ri and Li for each box Bi, and computes components Di, Ri 
and Li of the HSS form of A~^. This algorithm has complexity 0{N'^/'^) for a volume integral equation 
in 2D. It forms the starting point from which we derive an 0{N) algorithm in Section [sj 

The hierarchical structure this algorithm computes is not entirely the same as that of matrix A: 
L, R are not interpolation matrices, and D has nonzero diagonal blocks. However, it can be converted 



to standard HSS format if needed via the simple re- formatting algorithm in 17,18 



3. 0{N) Inverse Compression Algorithm. When applied to boundary integral equations (BIEs) 
in 2D, Algorithm 1 has optimal 0{N) asymptotic complexity for non-oscillatory kernels and a broad 
range of geometries. However, for volume integral equations in 2D, the typical complexity is 0{N^^'^). 

In this section, we first show why the algorithm has higher complexity for volume problems, and 
then develop a faster algorithm in two steps. First, we modify the inverse algorithm to expose the 
essential blocks of A~^, which can in turn be viewed as compressible operators acting on ID-like sets 
of points, and avoid the need to use non-low-rank matrix-matrix products in the algorithm. Then we 
demonstrate how these operators can be compressed using one-dimensional HSS structures. 



3.1. Efficient skeleton construction in 2D. 
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Algorithm 1 HSS matrix inversion 



for each box Bi in fine-to-coarse order do 
if Bi is a leaf then 

else 



E. 



ci(i) 



C2(i), 



end if 

if Bi is top-level then 

Fi — {Direct inversion at the top level} 
else 

Compute F^^ 

E, = {R,Fr^L,r^ 

Ri — EiRiF^ ^ 

A^Fr\l -L,R,) 

Li — F^ ^LiEi 
end if 
end for 



ID 



B; 



2D Bi 



Bi 



Fig. 3.1. Interaction ranks in ID and 2D. Source box Bi is recursively subdivided into well-separated sets, whose 
interaction with Bj is constant rank. This provides an upper bound for the overall interaction rank. 



Skeleton size scaling. Let us first consider for a non-oscillatory kernel (e.g. Laplace) the rank 
of the interaction between two neighboring boxes for dimensions D = 1,2. This example captures the 
essential behavior and size of skeleton sets in different dimensions. (Figure [3^^ 

Let Bi be a box at level £ of the tree in D dimensions, with rrii ^ ne — points. We estimate 

the e-rank ki of the interaction of Bi with a box of the same size adjacent to it. If sources and targets 
are well-separated (the distance between the set and target points is at least the set's diameter), the 
rank of their interaction matrix can be bounded by a constant p for a given e. We perform a recursive 
subdivision of our box Bi into well-separated sets until they contain p points of less. Then 



log2(nf/p)/-D 
s=Q 

For D = 1, we have log2{ni/p) intervals, and so ki ^ 0(log2(n£)). For D ^ 2, we subdivide 
log^{n£/p) times in the direction normal to the shared edge of the boxes, getting 2" well-separated 
boxes. Then, ki ^ 2'°S2("«/p)/2 = 0(nj^^) A simple calculation shows that this yields an estimate 0{N) 
for the complexity of the ID inversion algorithm, (the logarithmic rank growth with box size does not 
affect the complexity, because the number of boxes per level shrinks exponentially) and 0{N'^^^) in the 
two-dimensional case. 

Structure of skeleton sets in 2D. The HSS hierarchical compression procedure requires us to 
construct skeleton sets for boxes at all levels. 

The algorithm of Section |2.2| starts with constructing skeletons for leaf boxes, using interpolative 
decomposition of block rows with equivalent density acceleration. For non- leaf boxes at level i, index sets 
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A) Skeleton sets 



>1e-5 
i1e-6 



B) Log Interpolation Error || A^^ - A^^'T || / ||A^, 



"Ho 60 80 100 120 140 160 180 200 

Skeleton Points 



Fig. 3.2. A) Skeleton sets picked by the interpolative decomposition for different relative interpolation accuracies 
B) Log plot of relative interpolation error depending on skeleton set size 



are obtained by merging skeleton sets of children and performing another interpolative decomposition 
on the corresponding block row of to obtain the skeleton. 

This approach works for one-dimensional problems, but for two-dimensional ones applying inter- 
polative decomposition at all levels of the hierarchy is prohibitively expensive even with equivalent 
density acceleration: the complexity of each decomposition for a box Bi is proportional to kf n^J"^ in 
the two-dimensional case. As a consequence, the overall complexity cannot be lower than 0{N^^'^). 

Our algorithm for constructing the skeleton sets at all levels is based on the following crucial 
observation: It is always possible to find an accurate set of skeleton points for a box by searching 
exclusively within a thin layer of points along the boundary of the box. This observation can be justified 
using representation results from potential theory, cf. Section |2.2[ It has also been substantiated by 
extensive numerical experiments. Increasing target accuracy adds more points in deeper layers, but the 
depth never grows too large: for the kernels we have considered, the factorization selects one boundary 
layer for e ^ lO"'^, and two layers for e ^ 10"^*'. (Figure 3.2). 

This observation allows us to make two modifications to the skeleton selection algorithm. First, we 
restrict the set of points from which the skeletons are selected a priori to m boundary layers. Second, 
rather than selecting skeleton points for a parent box from the union of skeletons of child boxes using 
an expensive interpolative decomposition, we simply take all points in the boundary layers of the parent 
box. 

More specifically, li = ^ci(i) '-' -^C2{i) '^P^^^ -^i'' ' skeleton of Bi , consisting of all points of 
li within m layers of the boundary of Bi, and — Ii \ If'' (residual index set), consisting of points at 
the interface of two child boxes. To obtain the interpolation operator : I^^ If^ provided by the 
interpolative decomposition in the slower approach, we use a proxy set ZP''™^ — {zj}^'^-^ (as described 



in Section 2.2), and compute Ti from the following equation: 



Remark 3.1. . The set from which the skeleton set is picked has fixed width. This means that 
matrices acting on the skeleton set are compressible (in the HSS sense) in a manner analogous to 
boundary integral operators and admit linear complexity matrix algebra. These matrices in essence act 
like boundary-to-boundary operators on the box. 

3.2. Overview of the modified inversion algorithm. Construction of a compressed inverse 
of an HSS matrix described in Section [Z4| proceeds in two stages: first, the HSS form of the matrix 
A is constructed, followed by HSS inversion, which is performed without any additional compression. 
In our modified algorithm, significant changes are made to the second stage, including compression of 
all blocks. Not all blocks constructed at the first stage (compression of A) are needed for the inverse 
construction, so we reduce the first-stage algorithm to building the tree and interpolation operators 
Li,R, only. 

Examining Algorithm [T] we observe that all blocks formed for each box Bi involve the factors 
Ei,F^^ and the interpolation matrices Li,Ri. One first step to save computation is to compute 
{Ei,F^^} only. Matrix-vector multiplication for blocks D, L and R needed for the inverse matvec 
algorithm can be implemented as a sequence of matvecs for Li, Ri, Ei and F~^. 
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The main operations in the algorithm are the two dense block inversions in lines 10 — 11. Since Fi 
is of size — fcci(i) + ^C2(0 (merge of two skeleton sets) and E.^ is of size fcj, storage space of these 
blocks is 0{kf) = 0{ni) and inverting them costs 0{kf) — 0{n'J'^) floating point operations. This 
observation shows that it is impossible to obtain linear complexity for HSS inversion if we store and 
invert Ei and Fi blocks densely, or even to build the interpolation matrices that form Li and Ri. 

We first present a reformulation of the HSS inversion algorithm that partitions computation of Ei 
and F~^ into block operations, allowing us to compress blocks as low rank or one-dimensional HSS 
forms. As it is typically the case, it is essential to avoid explicit construction of the blocks that we want 
compressed, i.e. they are constructed in a compressed form from the start. 

Building and inverting Fi. For a non-leaf box i?^, Fi is a linear operator defined on li = 
-^ci{i) '-' - ^c'^(i ) ' ^^^^ '-'^ ^^"^ merge of skeleton points from its children. Using physical interpretation of 
Section [2T4I it maps charge distributions to fields on this set, adding contributions from local operators 



E^, (iu ^'C2(i) sibling interactions. We expect Fi to have a rank structure similar to that of K[Ii, li]. 

^ to be used in the inverse 
be the permutation 
/ 



ci (i) 

Aside from needing a compressed form and an efhcient matvec for F^ 



HSS matvec algorithm, we also use F^ to construct E,^ 
matrix that places skeleton points first. Matrices Ri = [l T"^] H^ and L 



Let H, 
= H. 



dn\T 



have a block 



form, with sub-blocks T!"^ and T!*', which we will show to be low-rank (Section 3.3.2 1. To construct Ei 
efhciently, we need an explicit partition of Fi into blocks matching blocks of Ri and Li, i.e. corresponding 
to skeleton index set 7*'^ and residual index set /^'"^ of B,;. We use the following notation for the blocks 
of 



n,i^Hr 



pr 



' p \ Tsk rsk] JT \T^^ T'^^W f 

where we use s to refer to the skeleton set of the parent and r to the "residual" set (the part of the 
union of the skeleton set of the children not retained in the parent). 

To represent F~^ we use these subblocks and perform block-Gaussian elimination. If $ = H^Fj^^^ 
Then: 



We note that, by eliminating residual points first, (/)|'^ is the inverse of the Schur complement matrix 
(Sl')-^ = [Fl^ - Fi^-'iFl^y^Fl^']-^, and that only {{Fl')'^ , Fl^' , Fi"-"- , {Sr)-^) are needed to 
compute the blocks for the inverse $i. The routine in our inverse compression algorithm that builds 
F~^ (Section 3.3.3) constructs these four blocks in compressed form. 



Building E^. Following the definition of Ei and the block structure of F^ we obtain the expres- 



sion: 







I 









As T^P^Tf" are low rank, the last three terms in this sum are low rank, too. Hence, Ei can be 
computed as a low rank update of (pf'' , the inverse of S'f'' 



We summarize the modified algorithm below; at this point it is a purely algebraic transformation, 
we explain in greater detail how the new structure can be used to compress various matrices. Next to 
each block we write between brackets the type of compression used in the 0{N) algorithm: [LR] (low 
rank) or [HSSID]. The output of the new form consists of: 

1. The blocks of F-^: {{F['')-\ F['~'' , Ff'-'' , (S'[")-i} 

2. The matrix Ei 

The entries of blocks of Fi are evaluated using the formulas in Algorithm [T] 

In the remaining part of this section we elaborate the details of efficient construction of all blocks 
using HSS and low-rank operations. 
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Algorithm 2 Modified HSS inversion algorithm 



1 


for each box Bi in fine-to-coarse order do 




2 


if Bi is a leaf then 




3 






4 


else 




5 


Compute blocks Ff'', Ff'^ from f,i, -Erofn) 






In compressing and inverting the matrix Fi for non-leaf boxes, 


we store sub-blocks of Fi and 




apply block inversion formulae as explained above: 




6 


{Fn-'^{F[ir.ir])-' 


[HSSID] 


7 


Compute blocks F^^" , Ff^"^ 


[LR] 


8 


[SIT' - [Ft - Fr^iFrr'FrT' 


[HSSID] 


9 


end if 
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[HSSID] 


11 


end for 





3.3. Compressed two-dimensional HSS inversion. We use two compressed formats for vari- 
ous linear operators in the algorithm; low-rank and one-dimensional HSS (dense-block HSS) described 
in Section [2] 

Operator Notation. To distinguish between linear operators compressed in different ways, we 
use different fonts: 

• X (in normal font) refers to an abstract linear operator, with no representation specified. 

• X refers to a dense-block HSS representation of X; 

• X refers to a low-rank representation of X 

The font used for interpolation operators like R{:,J) = [/ T] refers to the representation of T. 
Operations such as matrix- vector multiplies should be understood accordingly: e.g., Xv is evaluated 
using a low-rank factorization of X; if the rank is q, and vector size is n, the complexity is 0{qn). 
Similarly, Xv is an 0{n) application of a dense-block HSS matrix. 

We say that the algorithms operating on per-box matrices are fast if all operations involved have cost 
and storage proportional to the block size 0(7i^/^) or that times a logarithmic factor 0(rt^/^ log''(n)). 
Our algorithm includes three main fast subroutines. 

1. INTER.LOWRANK: Interpolation matrices Tj are built in low rank form. 

2. BUILD_Finv: Given dense-block HSS matrices {f^u^): '^c^C*)}: it computes {(J"f ")"!, Ff^^ Fp^ (5,")-^ 
in their respective compressed forms. 

3. BUILD_E: Given {(J"f ^)-i, F|^^ Fp^ {SrT^} and T^, if", it computes as a dense-block 
HSS matrix. 

3.3.1. Fast Arithmetic. There is a number of operations with low rank and HSS matrices which 
we must be able to perform efficiently: 

• Dense-block HSSID compression, inversion and matvec: These are the algorithms 



in 18 , which we have outlined in Section 2, and we denote the corresponding routines as 

HSSID Compress and HSSID Invert. 

Fast addition and manipulation of HSSID matrices: 

— HSSlD.Sum: Given two matrices A and B in HSS form, return an HSS form for C = 
A + B 

— HSSlD.SpHt: Given a matrix A defined on Ii U I2, it produces the diagonal blocks Ai 
and A2 in HSS form. 

— HSSlD_Merge: Given matrices Ai and A2, it concatenates them to produce the block 
diagonal HSS matrix A, and sorts its leaves accordingly. 

— Additional HSS compression routines: 

* LR_to_HSSlD: Convert a low-rank operator to dense-block HSS form. 

* HSSlD.Recompress: Using the algorithm by Xia ^39], we re-compress an HSS ID 
form to obtain optimal ranks. It is crucial to do so after performing fast arithmetic 
(e.g. a sum) of HSS matrices. 
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Remark 3.2. Matrix-matrix products. As we have mentioned before, a feature of our algorithm 
is that it avoids using the linear yet expensive matrix-matrix product algorithm for structured matrices. 

We arrange the computations so that all matrix-matrix products are between a dense-block HSS H 
of size k X k and a low rank matrix K = UV^ of rank q. Since Till requires only q 0(k) fast matvecs, 
these products may be computed in 0{qk) work. 

Remark 3.3. For every call to HSS ID Sum in the algorithms below, it should be assumed that it 
is followed by a recompression step (a call to HSS 1 D_Recompress) . Due to the theorem 5.3 of l3§^, 
this implies all dense-block HSS matrices presented are compact, that is, their blocks have a near-optimal 
size. 

Randomized interpolative decomposition RAND ID. The randomized sampling techniques 



in 24 29 38 speeding up the interpolative decomposition are critical to obtain the right complexity 
for the compression of low-rank operators. For an m x m dense-block HSS matrix A of rank q, the 

1/2 

matvec has complexity 0{m) time. Assuming that at level £ of the binary tree T, m — 0{n^ ) and 
q = 0{\og{ni)), the complexity of the randomized IDs is: 



0{mq^ + q^)^0{n]'^\og^{n,)) 

3.3.2. Interpolation Operators in low-rank form. Wc recall that interpolation operators Tj 
are built with the binary tree T using an interpolative decomposition, and that they are inputs to the 
dense-block HSS inverse algorithms. They encode interactions between the residual points indexed as 
/,[ and the exterior Af ^* as a linear combination of interactions with the skeleton points indexed by 
Zf'^. In other words, they are solutions of the linear equation: 



K[X^^''\Xf]Ti = A[Af"'*, a;"] 



The acceleration proposed in Section |2.2| implies that interaction with outside points may be rep- 
resented with a proxy Z^"^^ of charges right outside the box boundary. We employ as many layers as 
are kept for Jf*^, and remove points until | Z'''^"''-*' | = |Af'^| — ki. This yields an equation for T^: 



A[Zf°"^ Af^lT, = K[Zf°''\Xr] 



^j-^pioxy^ ^sfej encodes interactions between two close boundary layer curves, their distance being equal 
to the grid spacing h. It is invertible, ill-conditioned, and most importantly it has dense-block HSS 
structure. 

KyZ^™^^ ^Xl"] encodes interactions with the interfacial points. Except for the closest layers, the 
interface is well-separated from the proxy, and so it is easy to see (via a multipole argument, or numer- 



ically as in Figure 3.3) that this matrix is of logarithmic low rank: 




Fig. 3.3. Skeleton points are in black, residual points in blue and proxy points (diamonds) in green. We apply an 
ID with e = le — 10, and label subselected interface points also in green. 



As a result, Ti = K[ZY°^^ , X^^] ^K[ZY°^^ ,Xl^] is also a low rank operator. We describe a fast 
algorithm to compress this operator. We note that while the description above follows the case of boxes 
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in the plane, these observations should hold in general, as long as most residual points are well separated 
from the proxy. 



Algorithm 3 Algorithm INTER_LOWRANK 



Input: Box Bi information, index sets I?*^ and I^'^; Output: {Ui,Vi} such that T; — UiV^ 

(i) Compress and Invert HSS ID operator 
l^s^P =HSSlD_Compress(X, Zf™''^, Af ^ e) 
(^s^p)-i ^HSSlD_Invert(/C"-^P,e) 

(ii) Randomized interpolatory decomposition 
K7^p =HSSlD_Compress(A, Zf , A[^ e) 
[T^^p^jr^p] =RANDJD(/C''^P,e); 

The ID gives us a low-rank decomposition of K^^p (of e-rank q): 
U, = {JC'^P)-^K'-^P{:, r^P{l : q)); 
(:, J''^P) = [/ T'-^P] 



Remark 3.4. In general, zf^"^^ will have slightly more points than X^'' , making /C*~*'^ a rectan- 
gular matrix. A fast HSS least squares algorithm such as in 1^13 ^ 251 replaces the inversion and inverse 
apply in lines 3 and 8 of Algorithm^ with no impact in complexity. 

3.3.3. Compressed F~^. We define some auxiliary index sets for the merge of two children boxes: 
are skeleton points shared with child c(i) (on boundary layers) and are residual children's 



-J c(i) ^"^^ skeleton points shared with child c{i) (on boundary layers) and /J'^(j) 
skeleton points (in the middle interface). We define If^ = ^ ^^'^ ^ 



:,C2(i) ' 



jrs y jrs 
i,ci(i) i.C2{i) 



.«««»«««««« 



Fig. 3.4. Merge of two children's skeleton indices. 



We recall that first, matrix F is built and separated into blocks corresponding to skeleton 7*'^ and 
residual sets. These two sets can be seen as arranged along one-dimensional curves (boundary of 
the parent box and interface between siblings). We order /f'' in the direction of the interface, and If'' 
cyclically around the box boundary. 

Then = F(/[^ I^") and J"/'' = F{lf,lf) can be constructed in dense-block HSS form. 

The two off-diagonal blocks {Ff^'', F^^''} encode interaction between sets that are close only at few 
points, and can thus be compressed as low-rank operators. 

BUILD_Finv routine. To clarify the structure of the construction of F^^, in Algorithm 4 we 
show this routine in two ways. On the left, we indicate the original dense computation, and on which 
line of the reformulated algorithm in Section 2.5 it occurs. On the right, we indicate the set of fast 
operations that replaces it in the linear complexity algorithm. 

Applying F^^ to a vector. Once we have the four compressed blocks {{Fl^)^^ ,¥1^"^ , Fl^*' ('^r")"^}; 
the routine APPLY _Finv can be used to compute the fast product a — F^^u. This is done in a 
straightforward way splitting u into and using fast matvecs of the blocks of F^^ . 

3.3.4. Compressed 8. The last piece that is required is a routine that constructs f as a dense- 
block HSS matrix from in compressed form. For a box i3i, Ei is a linear operator defined on the 
skeleton set If'^ . Its inverse is obtained by applying to this set using interpolation operators: 



Fj ^ — LiF^ ^Ri 
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Algorithm 4 BUILD_Finv 



Input. Children's matrices £ci{i) and Sc2(i) in HSS form (each defined on skeletons 
Output. 

f. {J^l^)^^ in dense-block HSS form on the residual set /J''*. 

2. F|^'',F[^'' as low-rank operators. 

3. Inverse of Schur complement matrix (5^ in dense-block HSS form on If^ . 



(i) Obtain diagonal blocks of E by splitting into boundary and interface 



jpsk 



Cj (i) ' ^i.cj (i) ) 



pr 



Trs 1 



CT 



HSSID Split(£, 



(ii) Build diagonal blocks of F: merge diagonal blocks of E, compress off-diagonal blocks 



Line 5: 





-psk 




" 

psk^ 


+ 








Line 5: 










prs _ 


■prs 




" 

prs^ 


+ 









sk,off 



HSSlD_Compress(is:, /f*^); 



csk \ 

sk\ 



= HSSlD_Sum(£:f '"^^/Cf •°^^''); 



ff^HSSlD_Merge(£:-(,),f-(^)); 
jQrs,off ^ HSSlD_Compress(if,/[''); 
- HSSlD_Sum(£';"''^^/C,["'°^'^''); 



(iii) Inverse of 



Line 6: 



HSSlD_Invert(J7^); 



(iv) Low rank decompositions for Ff^'' and F^'*"* using Randomized IDs: 



Line 7: 

Fr<r-s J?\TTS Tsk] 



[T[,Jl] =RAND_ID(F/^'',e); 
[Tf, Jf] =RAND_ID(i^/■^^£); 



(v) Schur complement as a low rank perturbation of Ff 



-Pf = LR_to_HSSlD(-Ff^''(J'['')-iF[^ 

= HSSlD_Sum(J■/^■pf ); 
(^Srsyi ^ HSSlD_Invert(5,p); 



Line 8: 

(^S^syi ^ [F[I^'',I^''] - F'^''{F''')-^F'' 



Hence, the physical intuition again is that Ei has a rank structure similar to that of K[I^^ ,1^^]. 
In [8l [T2l[32] its inverse is called a reduced scattering matrix. From a purely algebraic point of view, we 
observe that f is a low-rank perturbation of S^^ . Hence, if S^^ has dense-block HSS structure, so does 
£. 



BUILD.E routine. Recalling the definition of Ei in Algorithm [2] 



^ R^F-'L, = [/ T^] 







I 









yupirs 



(t: 



dnsjT 



we observe the last three matrices are low-rank. Using the Schur complement formulae we obtain 
explicit factorizations for each one, recomprcss as a single low-rank matrix and then convert it to the 
dense-block HSS form using LR_to_HSSlD. 

We can perform the low-rank matrix products (of rank qi 
operations, since all of the products involved are fast {0{ki 
\U''\ = k,). 



in Algorithm [5] in 0{miqi) or 0{miqf) 
or 0(mi — kj), where = rrii and 



3.4. Inverse matrix-vector multiplication. Once the compressed form of the inverse is ob- 
tained, it can be efficiently applied to a right-hand side vectors with the algorithm of Section [2.3[ but 
using fast algorithms for our compressed representations of blocks. We present it here for completeness 
(Algorithm [6]) . 

We notice that application of Ri,Di and Li is substituted by fast matrix vector multiplication of 
Li,Ri , Ei and the blocks comprising Flj^^ . 
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All of these have complexity 0{ri'J'^) or 0(n^J'^ \og{ni)). 



Algorithm 5 BUILD_E 



Input. as blocks: (J"f'')-\ Ff^^ F[^^(5['')-l; L, and Ri in low-rank form: T7,J"P,Tf", and 

jdn^ jup ^ u^Pyup^ ^jdn^T ^ ijdnydn ^ Output. £, in HSS form (on the curve ) 

4: Recompress: Uf{y,^f = Y.l^,Ulp{y^^Y 
5: = LR_to_HSSlD(f/f , vf) 
6: = HSSlD_Sum((5,p)-\Xi) 
7: e, = HSSlD_Invert(£:-i) 



Algorithm 6 HSS inverse matrix-vector multiplication. 
Input 

1. The binary tree T including skeleton set indices and Ri,Li associated with boxes; 

2. HSS-compressed inverse: per-box blocks forming Fi and £i. 

3. the vector / of field values defined at source points. 

Output. The output is fj = A^^f, where is the compressed inverse. 
{Upward Pass, compute u"^} 
for each box Bi in fine-to-coarse order do 
2; 1: if Bi is a leaf then 

3: U = fiir) 

4: else 

up 
up 



6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 

22; 
23: 



end if 

Multiphcation by R, = E.RiFr^-. 
ip, = APPLY_Finv(M) 



up _ 



end for 

{Downward Pass: compute 4>f'^; the result A~^f on leaf Bi is 

^f^p = , < = " 

for each B^ in coarse-to-fine order do 

Multiphcation by 5, = F^^^[I - L,Ri] 

Define u as above 



K = U - _ 

Multiphcation by L, = F-^L,E,: pf" = L,£,(/)f" 
Add both contributions multiplying by the common factor F.[ 
if Bi is a leaf then 

cr(/f") = APPLY_Finv(j/f" 
else 



end if 



APPLY_Finv(t'; 



dn 



dn ^ 



24; end for 



4. Complexity Estimates. In this section we estimate the computational complexity of the 
dense-block and the compressed-block algorithms presented in Section[3j under a number of assumptions 
on the rank structure of blocks of A. 

We first define a framework to estimate work and storage for algorithms defined on a binary tree T, 
and use it to analyze all types of algorithms described in this paper. We consider both non-translation 
invariant (NTI), and translation-invariant kernels (TI) for which significant performance gains can be 
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obtained. 



4.1. Complexity of algorithms on binary tries. All our algorithms compute and store matrix 
blocks associated with boxes organized into a binary tree T. To produce complexity estimates for a 
given accuracy e, we introduce bounds for work We{ng,e) and storage M^(n^,e) at each level £ of the 
tree, where ne = 2~^N is the maximum number of points in a box at this level (we assume that the 
work per box on a given level has small variance). 

Lemma 4.1. Let ni — 2~^N, and d = \og2{N / n,nax) , o,nd exponents p,q > 0. Then if Wi{ni, e) 
has the form Cen^\og2{ne), the total work has complexity 



0{N) : < p < 1 

NTI: 2'''Wi{ni,e) = < 0(iVlog^+^ N) : p ^ 1 

OiNPlog^N) :p>l 



i=0 



TI: Wiine, s) = 0{NP log" N) 



For NTI algorithms, on each level £ we obtain an estimate by adding the bound for work per box 
for the 2^ boxes. The polynomial growth in or Mi is compensated by the fact that the number of 
boxes decreases exponentially going up the tree. 

If the rate of growth of Wi is slower than linear (p < 1), the overall complexity is linear. If grows 
linearly, we accumulate a logiV factor going up the tree. If the growth of Wg is superlinear {p > 1), 
the work performed on the top boxes dominates, and we obtain the same complexity as in Wi for the 
overall algorithm. 

In the TI case, work/storage on the top boxes dominates the calculation, since only one set of 
matrices needs to be computed and stored per level. Hence, the interpretation is simpler: the single- 
box bound for complexity at the top levels reflects the overall complexity. 

4.2. Assumptions on matrix structure. Complexity estimates for the work per box require 
assumptions on matrix structure. Let us restate some assumptions already made in Sections [2] and [3j 

1. Skeleton size scaling: The maximum size of skeleton sets for boxes at level £ grows as 

1 /2 

0{ng ). This determines the size of blocks within the HSS structure, and can be proved for 
non-oscillatory PDE kernels in 2D. 

2. Localization: Equivalent densities may be used to represent long range interactions to within 
any specified accuracy, cf. Section [2?2| 

3. Skeleton structure: The skeleton set for any box may be chosen from within a thin layer of 



points close to the boundary of the box, cf. Section 3.1 
4. Compressed block structure: Experimental evidence and physical intuition from scattering 
problems allows us to assume that the blocks of F and E discussed in Section [3] have one- 
dimensional HSS or low-rank structure, with logarithmic rank growth ( 0(log(n^)) ). 
These assumptions arise naturally in the context of solving integral equations with non-oscillatory PDE 
kernels in 2D. All assumptions excluding the last one are relevant for both dense and compressed block 
algorithms. The last one is needed only for the compressed-block algorithms. 

We note assumption [3] implies [1} being able to pick skeletons from a thin boundary layer determines 
how their sizes scale. We mention them separately to distinguish their roles on the design and complexity 
analysis of our algorithms: while [l] mainly impacts block sizes on the outer HSS structure, [3] is much 
more specific and refers to a priori knowledge of skeleton set structure which we exploit extensively in 
the compressed-block algorithm. 

4.3. Estimates. We analyze work and storage for the algorithms of Section [3j Since they all use 



the same set of fast operations (see Section 3.3.11, we can make unifying observations: 
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Work. 

1. Assumptions [l] and [3] imply that our fast subroutines perform operations with HSS blocks of 
size O(ny^). Further, Assumption [4] states that these behave like HSS matrices representing 
boundary integral operators, for which all one-dimensional HSS operations are known to be 

1 /2 

linear in matrix size. Thus, all HSSID operations are 0{n/ ), including matrix application. 

2. As indicated in Remark |3.2[ for an HSS matrix of size k x k, products of HSS and low-rank 
matrices require 0{kq) work. Assumption [4] implies all such products are O {nY"^ \og2{'ni)) . 
Low-rank matrix matrix-vector multiplication has the same complexity. 

3. Finally, both LR_to_HSSlD and RandJD involve interpolative decompositions of a matrix of 
size 0{n/ ) x 0{\og2{ne)), and, therefore have complexity 0{n/ log2(n^)). Products between 
low-rank matrices are also of this complexity. 

Storage. 

1. Again, since all HSS blocks behave as operators acting on one-dimensional box boundaries, 

1 /2 

storage is linear with respect to the number of nodes along the boundary of the box: 0{n/ ). 

2. A low-rank matrix of size m x n and rank q occupies 0{{m + n)q) space in storage. By 
Assumption |4j storage of low rank blocks (L, R and off-diagonal blocks of F) is ©(n^^ log2{ni)). 

Algorithms [3] [4] and [5l require only the operations listed above. We observe that all algorithms 
contain at least one 0(71/ ^ log2(nf)) operation. In terms of storage, INTER_LOWRANK and 
BUILD_Finv store both HSS and low-rank blocks (0(71^^ log2(n£))), and BUILD_E one HSS block 

1 /2 

{0{n/ )). Hence, the compressed-block interpolation operator build and inverse compression algorithms 
perform Wf^ — 0{ny^ log^in^)) work per box and require M^'^ = 0(71^^ \og2{ng)) storage for each 
set of matrices computed at level £. As a contrast, their dense-block counterparts have Wf^ — 0(n^^^) 
and Mf^ = 0{ne). 

We note that a more detailed complexity analysis may be performed to obtain constants for each 
subroutine, given the necessary experimental data about our kernel for a given accuracy e. The specific 
dependance of these constants on accuracy is briefly discussed and tested in Section |5.2[ 

We summarize the complexity estimates in the following proposition 

Proposition 4.2. 

Dense-Block Algorithms Let A be an N x N system matrix such that assumptions 1-3 hold. 
Then, the dense-block tree build and inverse compression algorithms perform 0{N^^^) work. For NTI 
kernels, storage requirements and matrix apply are both 0{N log N). For TI kernels, storage is 0{N), 
and matrix apply is 0{N log N) . 

NTI Compressed-Block Algorithms Let A be an N x N , non translation invariant system 
matrix such that assumptions 1-4 hold. Then compressed-block tree build, inverse compression and HSS 
apply all perform 0{N) work, and require 0{N) storage. 

TI Compressed-Block Algorithms Let A be an N x N, translation invariant system matrix 
such that assumptions 1-4 hold. Then compressed-block tree build, inverse compression and HSS apply 
all perform 0{N) work, and require 0{N) storage. In fact, inverse compression work and storage are 
sublinear: 0{N^/'^logl N) and 0{N^/'^ log2 N), respectively. 



The limitations of dense-block algorithms now become clear. With notation as in Lemma 4.1 
dense-block algebra corresponds to {p,q) — (3/2,0) for work and {p,q) = (1,0) for storage, which 
precludes overall linear complexity. For the compressed block algorithms, on the other hand, we have 
{P:Q) = (1/2,2) for work and {p,q) = (1/2, 1) for storage, which does yield linear complexity. 

Remark 4.3. As we have previously observed, the compressed-block algorithm may be generalized 
to system matrices with other rank growth and behavior. More specifically, we observe that a sufficient 
condition to maintain optimal complexity could be that the outer HSS structure ranks grow as 0{n^), for 
p < 1, and that the most expensive operations such as the randomized interpolative decomposition also 
remain being sublinear. This would require all low-rank blocks to be of rank qg ^ jj^™"{i/3Xp i)/2p}^ 

Remark 4.4. In all our practical implementations of the compressed-block algorithms, we perform 
dense computations for blocks up to a fixed threshhold skeleton size fccut; after which we switch to the 
fast routines. This may then be tuned as a parameter to further speed up these algorithms, and a 
straight-forward computation shows it does not alter their computational complexity. 
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5. Numerical Results. In this section, we describe a series of numerical experiments that test our 
inverse compression algorithm on both translation-invariant (TI) and non-translation-invariant (NTI) 
kernels. 

All results are obtained by running a Matlab implementation of our algorithm are obtained on 
server nodes with Intel Xeon X5650, 2.67 GHz processor, with no parallelization. 

General formulation. As we noted in the introduction, a considerable number of physical prob- 
lems involve solving one or a system of integral equations of the form 

A[<j]{:x) = a(x)<7{x) + f b{x)ICi\\x-y\\)ciy)aiy)dy^ fix), 
Jn 

where a, b and c are given smooth functions, and /C(r) is related to a free-space Green's function. 
Then, given {xi}^-^ e fi = [—1, 1]^ points on a regular grid with spacing h, we can perform a Nystrom 
discretization, obtaining a linear system Aa = /, with A an N x N matrix with entries: 



Aij = a{xi)S,^j + h'^b{xt))C{\\xi - Xj\\)c{xj) 

5.1. High accuracy: performance and scaling. Most of our tests are done for a target accuracy 
of e = 10~^° (referring to the local truncation error of all routines above); this is the "stress test" for our 
algorithm, as high accuracy requires larger skeletons. We measure the wall-clock timings and memory 
usage for the following parts of the solution process: 

1. building the tree and interpolation operators; 

2. inverse construction and compression; 

3. inverse matrix- vector multiplication (solve, timings only). 

We compare against the dense-block HSS inversion algorithm. We take /C(r) to be the 2D Laplace 
free-space Green's function: JC{r) = ^ log(r) and a = 1. 

• li b = c, A is symmetric, which leads to some computational savings in the HSS algorithms. 

• The case 6 ^ 1 is our example of a non translation invariant (NTI) kernel. Such an equation 
appears in forward scattering problems (zero or low- frequency Lippman-Schwinger equation). 

• The case 6 = 1 is one of our examples of translation invariant (TI) kernel. We recall that 
significant computational savings may be achieved in this case, given that we only build one 
set of matrix blocks per level. 

We note that these results are typical for non-oscillatory Green's function kernels: we have per- 
formed the same tests for the 3D Laplace single layer potential and the 2D and 3D Yukawa Green's 
functions, and found behavior analogous to the one described below. 

5.1.1. Non-translation-invariant kernel. We take b{x) to be a smooth function with moderate 
variation: 



b{x) = 1 + 0.5e" 



-(xi-0.3)^-(2:2-0.6)^ 



Although this may seem like a simplistic choice, it does not matter much for computational perfor- 
mance unless the problem is under-resolved or 6 is on a large subdomain. 

For the NTI case ranks of blocks in dense-block HSS matrices and ranks of low-rank blocks vary 
both by level and inside each level. However, we observe that the rank growth for these blocks matches 
our complexity assumptions. 

We set leaf box size at Umax — 7^, and for problem sizes N — 784 to iV = 3211264, we compare 
total inversion time (tree build -|- inverse compression) and total memory usage for the dense-block 
(HSS-D) and compressed-block (HSS-G) inverse compression algorithms. 

We observe a very close match between the experimental scaling and the complexity estimates in 



Proposition 4.2 which we recall on th e sec ond row of Table 5.1 



The slopes in a log-log plot (Figure 5.1 A) show that both inversion and tree build times are 0{N^^^) 
for the dense block version and 0{N) for our accelerated algorithm. The break-even point is around 
N = 10^, for which both methods take about 1.5 minutes. In the speedup plot in 5.1 B, we can clearly 



observe performance gains for N > 10^ due to difference in scaling. By N = 10^ our algorithm gains 
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N 


HSS-D Time 


HSS-C Time 


HSS-D Memory 


HSS-C Memory 




0(7V3/2) 


0{N) 


0{N\ogN) 


0{N) 


784 


0.11 s 


0.17 s 


4.68 MB 


4.48 MB 


3136 


0.67 s 


1.70 s 


29.09 MB 


25.24 MB 


12544 


4.50 s 


8.32 s 


159.59 MB 


123.07 MB 


50176 


31.45 s 


40.43 s 


819.58 MB 


538.51 MB 


200704 


3.79 m 


3.23 m 


3.72 GB 


2.23 GB 


802816 


28.35 m 


13.66 m 


17.27 GB 


9.23 GB 


3211264 


3.58 hr 


54.795 m 


70.99 GB 


34.09 GB 



Table 5.1 



Total inversion time and storage for the NTI inverse compression algorithms 



one order of magnitude speedup. Additional speed gains may be obtained by adjusting the size of the 
fine-scale boxes. 

The slopes of the log-log plot ( Figure [sTljC) confirm that memory usage for these algorithms behaves 



like O(A^logiV) and 0{N), respectively. Our algorithm uses less memory in all cases. (Figure 5.1 D). 
By TV = 10^ it uses 2.5x less memory (150 GB, which amounts to approximately 1800 doubles per 
degree of freedom). 



Matrix compression. Figure 5.1 A, shows that the tree build time is about 18% of the total 
inversion time. The constructed tree can also be used to compress the original matrix A in O(iVlogA^) 
work, and even in 0{N) work by incurring the small additional cost of compressing sibling interaction 
matrices as HSS ID. Even though our focus on building a fast direct solver, this suggests our accelerated 
method to compress interpolation matrices L and R readily yields 0{N) matrix compression. The 
compressed matrix can be used as a kernel-independent FMM, with a significantly simplified algorithm 
structure. 



(A) Log-log plot: tfT\ Tree build and Inverse Compression 




(B) Speedup: HSS-D/HSS-C tn\ Tree build and Inverse Compression 
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(D) Memory Ratio: HSS-D/HSS-C NTI Tree and Inverse 
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Fig. 5.1. NTI 2D Laplace Kernel: Inverse compression timings and memory usage. Total times and storage are 
represented in solid lines, Tree build and storage in dashed lines. 



5.1.2. Translation-invariant kernel. Exploiting translation invariance yields a substantial im- 
provement in performance, which stems from the fact that only one set of matrices per level of 7" needs 
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to be computed and stored. The full-block HSS inverse compression still scales as 0(iV^/^), but is 3x 
faster for all tested problem sizes, and memory usage becomes more efficient as iV grows. The lack 
of improvement in the asymptotic behavior of inverse compression when compared to the the non- 
translation-invariant case stems from the fact that 0{N'^^^) work is performed on the top boxes of the 
tree in both cases. 

As one would expect, our compressed-block algorithm displays sublinear scaling for both inverse 
compression time and storage, for the largest problems gaining an order of magnitude in both when 
compared to the NTI case. For N — 10^, it has become about 11 x faster (29 minutes), using 30 x 
less memory (5 GB). Building the binary tree, which as mentioned amounts to compressing A, displays 
similar performance gains. For N = 10^, it takes only 4.5 minutes to build and storage is only 1.3GB. 



N 


HSS-D Time 


HSS-C Time 


HSS-D Memory 


HSS-C Memory 




0(Ar3/2) 


0{N) 


0{N) 


0{N) 


784 


0.05 s 


0.13 s 


1.94 MB 


1.75 MB 


3136 


0.21 s 


0.98 s 


9.04 MB 


6.19 MB 


12544 


1.40 s 


3.41 s 


39.16 MB 


19.03 MB 


50176 


9.68 s 


10.76 s 


163.19 MB 


52.09 MB 


200704 


1.21 m 


30.89 s 


666.39 MB 


151.41 MB 


802816 


9.20 m 


1.59 m 


2.61 GB 


474.74 MB 


3211264 


1.19 hr 


6.68 m 


9.9359 GB 


1.56 GB 


12845056 


9.28 hr 


29.22 m 


39.74 GB 


5.29 GB 



Table 5.2 

Total inversion time and storage for the TI inverse compression algorithms 



The algorithm achieves parity with the dense-block HSS for lower values of (around N = 50000) , 
for which both methods take about 10 seconds to produce the HSS inverse. By N = 10^, our algorithm 



is about 20 X faster than the dense-block approach (Figure 5.2). We again observe that tree build 
time is consistently about 15%-20% of the total inversion time, and so our observation about matrix 
compression holds. Figure [K2} C shows sublinear scaling for memory usage. It remains true that our 
algorithm uses less memory in all cases. For A'' = 10^, it uses 8x less memory (5 GB, which amounts 
to 50 doubles per degree of freedom). 

5.1.3. Inverse matrix-vector multiplication. In order to test the inverse matvec algorithm, 
we run it with ten random right-hand sides and compute time per solve (in seconds) for both NTI and 
TI examples. We note that, given that the same matrix blocks are applied to the same subsets of each 
right-hand side, code can be easily vectorized leading to performance gains if multiple solves are to be 
performed simultaneously. If our matrix is translation-invariant additional speedup result from the fact 
that the same set of matrices is applied to vectors corresponding to all boxes in a given level. 

We note that for all cases, the inverse apply is very fast, scaling linearly and remaining well under 
a minute for sizes up to A^ ~ 10^. As expected, the differences between the standard and compressed 
block apply are small, the latter becoming marginally faster around A^ ~ 3 x 10^. 

For all cases, if ^ and are our compressed, approximate matrix and inverse, we use the following 
error measure: 

\\v-AA-'v\\ 
M 

Taking the maximum over a number of randomly generated right-hand-sides v. In this measure, 
which can be thought of as an approximate residual, both algorithms achieve the desired target accuracy 
for the inverse apply. We may then bound the exact residual by: 



\\v-AA-^v\\ <E + \\A-A\\\\A-^v 
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(A) Log-log plot: Tl Tree build and Inverse Compression 



(B) SpeedUp: HSS-D/HSS-C Tl Tree build and Inverse Compression 
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(C) Log-log plot: Tl Tree and Inverse Memory Usage 
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(D) Memory Ratio: HSS-D/HSS-C Tl Tree and Inverse 
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Fig. 5.2. Tl 2D Laplace kernel: Inverse compression timings and memory usage. Total times and storage are 
represented in solid lines, Tree build and storage in dashed lines. 



N 


NTI HSS-D 


NTI HSS-C 


Tl HSS-D 


Tl HSS-C 


NTI HSS-C 


Tl HSS-C 




0{N log N) 


OiN) 


O(A^logiV) 


0{N) 


Error 


Error 


784 


0.0014 


0.0018 


0.0007 


0.0011 


1.6e-14 


9.2e-15 


3136 


0.0064 


0.0090 


0.0031 


0.0046 


1.8e-14 


1.8e-14 


12544 


0.0292 


0.0362 


0.0137 


0.0162 


8.6e-ll 


5.7e-ll 


50176 


0.1320 


0.1546 


0.0590 


0.0600 


1.6C-10 


1.7e-10 


200704 


0.5993 


0.6772 


0.2819 


0.2512 


2.3e-10 


1.6e-10 


802816 


2.6611 


2.8193 


1.2709 


1.0763 


4.0e-10 


3.8e-10 


3211264 


11.816 


11.737 


5.77296 


4.5650 


5.1e-9 


1.6e-9 


12845056 


52.468 


48.8641 


25.8312 


19.3619 




3.2e-9 



Table .5.3 

Inverse apply timings (in seconds) for both NTI and Tl algorithms. Numbers in red are extrapolated from previous 



data. 



5.2. The eflfect of varying accuracy. If we set a lower target accuracy, this can significantly 
speed up both algorithms presented above: the number of boundary layers needed, the size of skeleton 
sets in T, ranks in low-rank and HSS ID blocks in the compressed-block algorithm are all lower, and 
low-rank factorization can be performed faster. 

In this set of tests, using the Laplace single layer potential in 2D, we compare their behavior for 
target accuracies e = 10~^ and 10^^°. 

Dense-block algorithm. In this case, only the change in skeleton set size is relevant. As in 
Section |4j we can bound ki < = C^'n'J'^ . In particular, e = 10~^ requires one boundary layer 
and £ — 10"^" requires two, and so Cig-io ^ 2Cio-5, as expected from standard multipole estimates 
(Ce ^ log(l/e)). Following our complexity analysis, we observe both tree build and inverse compres- 
sion algorithms perform 0{kf) work per box, and so the constant in the leading term is of the form 
C'log'^(l/e). This would imply a factor of 8 between e = 10^^ and e = 10^^". Analogously, inverse 
matvec work and memory usage are 0{k^) per box, and so we expect a factor of 4 between these two 
cases. 
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Compressed-block algorithm. Performance of the compressed block algorithm depends on all 

throe quantities mentioned above: the size of 2D HSS skeletons is again 0{log{nf /e)), but fast operations 
are performed on these. Hence, we expect a factor of log(l/e) to appear, at worst. Assuming sizes 
of dense blocks in low-rank and HSS ID representations are 0{\og{Tn/e)), the most expensive one- 
dimensional dense-block HSS perations are linear with a constant 0(log^(l/e)), and again matvecs and 
storage get a factor 0(log^(l/e)). 

Hence, we expect the factors to behave as 0(log^(l/e)) (16x) for tree build and inverse compression, 
and as 0(log^(l/£:)) (8x) for inverse apply and memory storage. 



N 


HSS-D Time 


Ratio 


HSS-C Time 


Ratio 


HSS-D Memory 


Ratio 


HSS-C Memory 


Ratio 


784 


0.01 s 


3.1 


0.03 s 


6.7 


0.45 MB 


3.3 


0.44 MB 


4.1 


3136 


0.03 s 


5.2 


0.07 s 


15 


1.92 MB 


3.7 


1.75 MB 


4.2 


12544 


0.20 s 


6.4 


0.75 s 


6 


7.98 MB 


3.8 


5.40 MB 


4.3 


50176 


1.35 s 


6.9 


2.83 s 


6 


32.53 MB 


3.9 


13.67 MB 


4.4 


200704 


9.77 m 


7.4 


7.87 s 


6.4 


131.39 MB 


4 


36.59 MB 


4.4 


802816 


1.22 m 


7.6 


20.88 s 


7.3 


528.19 GB 


4 


107.38 MB 


4.4 


3211264 


9.13 m 


7.9 


1.19 m 


8 


2.07 GB 


4 


349.50 MB 


4.2 


12845056 


1.14 hr 


8.1 


4.56 m 


8.9 


8.34 GB 


4 


1.21 GB 


3.8 



Table 5.4 

Inverse compression time and storage of the TI algorithms for e = 10~^, and the ratio of the quantities corresponding 
to 10"^" and 10~^ target accuracy 



Experimental results show that our estimates for the dense block algorithm (HSS-D) are close: as 

N grows, we observe that the ratio between compression times converges to 8, and that for memory 
usage to 4, as expected. However, although there more variation in the compressed-block results, we 
generally observe our estimates to be conservative, since the ratios are fairly similar to the dense-block 
case. 

Our hypotheses also appear to be conservative for the inverse apply algorithms: ratios between 
these two accuracies tend to be between 2 and 3 for all cases tested. 

We omit the results for the non-translation-invariant case, since the comparison and ratios between 
both target accuracies are quite similar. Overall, this implies that the analysis and observations in 
Section 5.1 can be applied with little modification to the case s = 10~^: for both algorithms, a faster 
but less accurate inverse can be compressed 8 times faster, stored using 4 times less memory, and can 
be applied two or three times faster than the high accuracy case. 

5.3. Low Frequency Oscillatory Kernels. Finally, we perform the same set of experiments as 
in Section 5.1, but this time taking /C(r) in our general formulation to be the 2D Helmholtz free-space 
Green's function for wave number k: 

(5.1) Qkir) = -'-H^ikr) 

where Hq is the Hankel function of the first kind. As we will see in Section 5.4, this is akin to 
solving the Lippmann-Schwinger equation for a given frequency k. 

As mentioned before, with some minor modifications our solver is able to handle oscillatory kernels 
for low frequencies. We briefly note the main differences with non-oscillatory problems: 

• Skeleton Structure: For high accuracy, skeletons with consisting of more layers of points 
are needed in comparison with non-oscillatory kernels such as Laplace. Besides this, as the 
wavenumber grows it is also necessary to keep points inside of the box in the skeleton set. 
Experimentally we determine that keeping several points per wavelength is enough to maintain 
accuracy for matrix and inverse compression. 

• Conditioning: Tlie Lippmann-Schwinger equation is moderately ill-conditioned, with condi- 
tion number depending on k. This impacts the matrix inversions in both levels of compression 
resulting several digits of accuracy. We note that this is mild for the wavenumbers tested, and 
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Fig. 5.3. 2D Helmholtz kernel skeleton set structure: Skeleton points are in black and residual points in blue. A few 
points per wavelength are kept on the interface between the two children boxes. 



that we can always gain them back by one round of iterative refinement (matrix and inverse 
apphes remain quite fast). 
• Rank growth assumptions: Ultimately, as k grows, the assumptions stated in Section 4 will 
fail: The size of skeletons in compressed blocks will depend on fc, as well as the number of 
points needed in our discretization. 
This points to the fact that a different approach is needed to build a fast direct solver that can 
handle moderate and high frequency regimes. 

5.3.1. Inverse compression results: TI and NTI cases. We present experimental results for 
both the non-translation-invariant and the translation- invariant cases. Let k = k/2iT he the number of 
wavelengths that fit in our domain, the unit box. For k = 4, 8, 16, 32, we set a leaf box size at Umax = 9^ 
and test for increasing problem size N. Here we compare total inversion time and memory usage for 
the HSS-D and HSS-C inverse compression algorithms. 

We observe that, while the cases for k = 4, 8 behave similarly to their Laplace counterpart (skeletons 
consist of 2 boundary layers), for k = 16,32 and higher at least 3 layers are needed, and accuracy in 
matrix compression deteriorates unless we also keep a few extra points per wavelength inside the box. 
As we will see, this change in skeleton set structure is the main cause for differences in behavior between 
these two sets of examples. 

We note that, given that the Helmholtz kernel is complex- valued, two doubles are stored for each 
matrix entry. This implies that memory usage can be a priori expected to be at least 2 times bigger 
than that of real-valued kernels. 

We first observe that for all cases, experimental scaling again coincides with the expected complexity 
(the 0(7V3/2) HSS-D algorithm is not plotted in figure [s!!! to avoid clutter). For the HSS-C algorithm, 
slopes in|5.4|A quickly approach to 1 as N grows. 



Break-even points and speedup: on the speedup plot in figure 5.4 B, wc can observe that the 
point where the compressed-block algorithm overcomes its dense-block counterpart grows slightly with 
K. While it is around 10^ for k = 4, 8, it grows closer to 10^ for k — 16, 32. We can also see a moderate 
speedup is gained after these break-even points due to better scaling, similarly to the case for Laplace. 

Memory Usage: In figures [5^ C and |5.4| D we again can observe the HSS-C algorithm always 
provides extra compression in terms of storage, and this improves as N grows. By N ^ 10^, it uses 
3 — 4 times less memory than HSS-D. 

For N ~ 10^ and k = 4, 8, the HSS-C algorithm takes about 1.3 hours to produce the HSS inverse, 
requiring 27GB of storage. For k = 16, 32, it takes 7 and 10 hours to build the inverse, respectively, and 
storage requirements go up to ^ 60GB. Comparing its performance with the Laplace NTI kernel in 
Section 5.1 (which requires 0.5 hours and 10 GB for this size), we see that as the wave number increases, 
it takes considerably more time and storage to compress the inverse. The most dramatic change can be 
observed for k — 16, 32, since skeleton sets become much bigger in size. 

Exploiting translation invariance yields significant performance gains for all tested wavenumbers. 
The HSS-D algorithm is again 0(7V^/^), and about 3x faster than the NTI case. We also observe 
sublinear scaling for inverse compression and storage for the HSS-C version, although a bit less dramatic 
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(A) Log-log plot: Nn Inverse Compression for varying K 



(B) Speedup HSS-D/HSS-C 




4^ 
Log(N) 

Fig. 5.4. NTI 2D Helmholtz kernel: Inverse compression timings and memory usage for increasing values of k 

(A) Log-log plot: Tl Inverse Compression for varyingK ^ (g) Speed Up HSS-D/HSS-C 




LQg(N) 

Fig. 5.5. TI 2D Helmholtz kernel: Inverse compression timings and memory usage for increasing values of k 



than in the case for the Laplace kerncL By N ^ 10^, it has become 5 — 6x faster, using 15x less memory. 

Since only one set of matrices is computed for each level in T, we can observe more clearly the 
impact of adding an extra layer of skeleton points: inverse compression time and storage become about 
1 order of magnitude higher. Also, there is a more pronounced difference between k — 16 and k = 32 
for bigger problem sizes. 

Break-even points and speedup: On the speedup plot in figure [53] B, we see that break-even 
points are a bit smaller when compared with the NTI case, and that they again grow with k. For k = 4 
it is below 10^, for k = 32, it happens around 3 x 10^. Speedups are again considerably better and 
faster growing due to sublinear scaling, especially for small wavenumbers. 

Memory Usage: The gain in compression is more rapid, especially for low wavenumbers. By 
N ^ 10^, it uses 3 — 5 times less memory than HSS-D. Also, in 5.5 D, we see the slopes for the memory 
ratio are higher than in the NTI case. 
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For N ^ 10^ and k = 4, 8, the HSS-C algorithm takes about 15 mmutcs to produce the HSS inverse, 
requiring 1.5GB of storage. For k — 16, 32, these go up to 1.2 and 2.6 hours to build the inverse, taking 
4.6 and 5GB of storage . Comparing its performance with the Laplace TI kernel (2 minutes and 0.5 
GB for this size), we observe similar differences in storage as in the NTI case, but much more drastic 
ones in terms of inverse compression time. 

5.3.2. Inverse matrix-vector multiplication. Finally, we test the inverse apply algorithm as 
in Section 5.1.3. Since the differences between dense and compressed block are quite small, we present 
results for the HSS-C apply only, for k — 8 and k = 32. The inverse apply is still considerably fast and 
retains optimal scaling. 



N 


NTI HSS-C 


NTI HSS-C 


TI HSS-C 


TI HSS-C 


TI HSS-C 


TI HSS-C 




K = 8 


K = 32 


K = 8 


K = 32 


Error (k = 8) 


Error (k = 32) 


1296 


0.0036 


0.0074 


0.0022 


0.0048 


le-13 


le-12 


5184 


0.0205 


0.0596 


0.0126 


0.0365 


5.5e-10 


2.6e-8 


20736 


0.1102 


0.3046 


0.0587 


0.1663 


1.6e-10 


8.7e-9 


82944 


0.5197 


1.4962 


0.2465 


0.8148 


2.3e-9 


3.6e-8 


331776 


2.5431 


6.7812 


1.0350 


3.4867 


l.Oe-10 


2.0e-8 


1327104 


10.1684 


27.1248 


4.3536 


15.4991 


1.2e-9 


4.9e-8 



Table 5.5 

Inverse apply timings (in seconds) for both NTI and TI algorithms, for different values of k. Numbers in red are 
extrapolated from previous data. 



For K — 16,32, we usually lose two digits in our accuracy measure (~ 10~^), when compared to 
our target 10^ As was mentioned above, this is an effect of conditioning and can be addressed if 
necessary by cranking up accuracy or by one round of iterative refinement. 

5.4. 2D scattering problem: Lippmann-Schwinger equation. 

Background on scattering. Consider an acoustic scattering problem in involving a "soft" 
scatterer contained in a domain Cl. For simplicity, we assume that the "incoming field" is generated 
from a point source at the points (z fl'^. A typical mathematical model would take the form: 

^2 

(5.2) - Au{x) —u{x)^6{x-Xs), xeR^ 

v[xY 

where n is the frequency of the incoming wave, and where v{x) is the wave-speed at x. We assume 
that the wave-speed is constant outside of fi, so that v{x) = vq for x € fl'^. To make the equation 
well-posed, we assume that u{x) satisfies the natural "radiation condition" at infinity. We define the 
"wave number" k via 

and a function b = b{x) that quantifies the "deviation" in the wave-number from the free-space wave- 
number via 



b{x) 



vix)"^ 



Observe that b{x) = outside Vt. Equation (5.2) then takes the form 
(5.3) - Au{x) - k'^u{x) + b{x)u{x) = 5{x - Xs), a; e 

Finally, let Gk denote the Helmholtz fundamental solution as in equation |5.H and split the solution u 



of (5.3 1 into "incoming" and "outgoing" fields so that u — Mj„ -I- itout, where 

Uin{x) = Gkix,Xs). 
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Then Uout satisfies: 

(5.4) - Auoutix) - k'^Uoutix) + b{x)uout{x) = -b{x)uinix), xefl. 



When converting (5.4) as an integral equation, we assume that b{x) is non-negative, and look for a 
solution of the form 



(5.5) 



Uoutix) = [SkiVbT)]{x) ^ / Gkix,y)^/b{y)T{y)dy. 



Inserting (5.5) into (5.4), and dividing the result by y^b(x), we find the equation 
(5.6) 



t{x)+ I \/b{x) Gk{x,y) y/b{y)T{y)dy = - b{x) Uin{x) , x S 

Test problem. For our numerical tests, we take: 

b{xi,X2) =0.25fc2(l + tanhL>(l-e - |a;i|))(l + tanhL>(l - e - |a;2|)) 

Note that b is equal to 1 in a subdomain of [—1, 1]^ and transitions exponentially to close to the 
boundary of the square. Parameters D and e can be manipulated to change this layer's position and 
width. We note that if b close to up to machine precision, the tree build routine must be modified to 
pick skeleton points along the support of b. 




Fig. 5.6. Plot of deviation b for the scatterer in our Lippman-Schwinger equation 



We implement an 0{h ) accurate corrected trapezoidal rule 14 by adding a diagonal correction to 



our kernel. Since our algorithm compresses off-diagonal blocks, this has no bearing on the performance 
of our direct solver. We note that higher order (up to 0{h^^)) quadratures of this form can be applied 
if needed. 

Numerical results. For k = 4, 8 and increasing problem size A^, we solve the Lippmann-Schwinger 



equation (5.6) using our solver for non-translation invariant, symmetric operators. 



For both for the approximate matrix A and its inverse, we first measure the empirical order of 
convergence when applying them to the same right hand side while refining grid size h. We confirm 
that the order is approximately 0{h'^) = 0{N^'^) until the error is comparable to the target accuracy. 

Since there is a much higher contrast in the kernel entries, and they get close to at rows and 
columns corresponding to the box boundary, through preliminary experiments we observe that 2 layers 
yield moderate accuracy (^ 10^^), and 3 are needed for high accuracy (^ 10^^"). We note that for 
cases where there is small variation within the domain of interest VL, this approach could be optimized 
by giving special treatment to boxes which intersect its boundary. 

Also, we observe that the inverse of the translation-invariant operator in Section 5.3 may be used 
as a right preconditioner for this problem. Using BiCGstab, we find that for a given wave number k, 
the number of iterations to solve the preconditioned system is moderate and independent of problem 
size N . 



We test the direct solver and the preconditioned iterative solver proposed above (Table 5.6) For 



the latter, the compression step includes tree-build (matrix compression) for the Lippmann-Schwinger 
kernel, and inverse compression for the corresponding translation-invariant Helmholtz kernel (our pre- 
conditioner) . 
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N 


HSS-D Time 


HSS-C Time 


Iter Time 


HSS-D Memory 


HSS-C Memory 


Iter Memory 


784 


1.92 s 


1.86 s 


1.72 s 


18.74 MB 


18.74 MB 


9.63 MB 


3136 


11.81 s 


24.93 s 


9.30 s 


125.87 MB 


92.30 MB 


54.37 MB 


12544 


1.31 m 


2.68 m 


1.13 m 


720.90 MB 


464.81 MB 


258.97 MB 


50176 


8.03 m 


14.20 m 


5.65 m 


3.69 GB 


5.56 GB 


1.19 GB 


200704 


47.13 m 


1.17 h 


25.71 m 


18.34 GB 


13.22 GB 


5.53 GB 


802816 


4.78 h 


5.70 h 


1.93 h 


90.36 GB 


47.16 GB 


25.38 GB 



Table 5.6 

Total inverse compression time and storage for the Lippmann-Schwinger eguation for fc = 8 and e = 10"^" target 
accuracy 



N 


HSS-D Solve 


HSS-C Solve 


HSS Error 


Iter Solve 


Iter # 


Error 


784 


0.03 s 


0.02 s 


2.7e-ll 


0.59 s 


20 


8.0e-9 


3136 


0.08 s 


0.17 s 


5.2e-9 


3.13 s 


30 


9.2e-9 


12544 


0.36 s 


0.82 s 


7.1e-8 


14.34 s 


28 


4.4e-9 


50176 


1.74 s 


3.73 s 


3.6e-ll 


56.12 s 


28 


3.6e-9 


200704 


9.54 s 


20.21 s 


9.3e-ll 


236.23 s 


29.5 


7.7e-9 


802816 


52.50 s 


109.32 s 


1.2e-10 


1010.1 s 


29.5 


4.4e-9 



Table 5.7 

Inverse apply timings for the Lippmann-Schwinger equation for k = 8 and e = 10""'"'' target accuracy 



Finally, on Table 5.7 we compare inverse apply (solve) stage for both algorithms and for the iterative 
approach. 

We note that for the HSS-C algorithm, one round of adaptive refinement is needed in order to 
attain the target accuracy for the approximate residual. This means two inverse and one matrix applies 
are used in the solve. 

6. Conclusions and Future Work. We have described a direct solver for volume integral equa- 
tions in the plane that attains optimal 0{N) complexity and high practical efficiency for problems 
with non-oscillatory (or moderately oscillatory) kernels. The solver displays high performance for large 
problem sizes, even for high target accuracies. It gains a significant additional advantage when dealing 
with translation invariant kernels, since only one set of structured matrices needs to be computed for 
each level of the tree, resulting in sublinear scaling of inverse compression time and storage. 



The solver is based on the recursive skeletonization scheme described in 31 , which would have 
0{N^^^) complexity if applied to the volume integral equations considered here. We attained the 
acceleration to optimal 0{N) complexity by using structured matrix algebra to manipulate certain 
large dense matrices that arise in the computation. Specifically, we used the so called Hierarchically 
Semi- Separable (HSS), format which exploits rank deficiencies in the off-diagonal blocks of the matrix. 
As noted in Section 5, the accelerated direct solver provides better compression than the dense-block 
algorithm of 31 in all tested environments, and it becomes faster for sizes greater than N ~ 10^. 



Because of optimal scaling, this implies a one order of magnitude speedup for every increase in two 
orders in problem size TV. 

We also note that the increase in memory efficiency of our algorithm is of significant importance: 
given that all structured-matrix and FMM-like methods are memory intensive, the lack of sufficient 
compression can render them prohibitive for large enough problems. 

The fast tree build routine described can be modified into an 0{N) matrix compression algorithm 
which is competitive with the FMM and may be used with iterative methods such as GMRES, see 
Section 5. 

Future work directions.. There are several directions we plan to explore: 
• Extension to surfaces in 3D: we expect that the compressed-block HSS algorithms presented 
in this paper can be ported with no major structural differences to the case of solving boundary 
integral equations on surfaces in 3D, and our ongoing work features this extension. The main 
challenges will most likely come from a potentially more complex skeleton set structure and 
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from implementation considerations in more complex geometries. 

Having such fast direct solvers available is of most relevance to solve large scale boundary 
value and evolution PDE problems, with applications such as fluid-structure interaction and 
3D scattering problems. 

• Parallel implementation: as we have noted, the algorithms and implementation presented 
are serial. However, our direct solver is ideally suited for parallel implementation: intensive 
linear algebra operations are performed for each node on a binary tree, and the most expensive 
computations (inverse compression) require only one upward pass. 

• Direct solvers for oscillatory problems: ultimately, fast direct solvers such as this are 
needed to tackle oscillatory problems arising from accoustics and electromagnetics (e.g. Helmholtz 
equation and Maxwell equations) in moderate and high frequency regimes. 

Acknowledgments:. We would like to thank Mark Tygert and Leslie Greengard's group for many 
helpful insights during the design and implementation of our algorithm. 
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